A fast continuous time approach with time scaling for nonsmooth convex optimization

In a Hilbert setting, we study the convergence properties of the second order in time dynamical system combining viscous and Hessian-driven damping with time scaling in relation to the minimization of a nonsmooth and convex function. The system is formulated in terms of the gradient of the Moreau envelope of the objective function with a time-dependent parameter. We show fast convergence rates for the Moreau envelope, its gradient along the trajectory, and also for the system velocity. From here, we derive fast convergence rates for the objective function along a path which is the image of the trajectory of the system through the proximal operator of the first. Moreover, we prove the weak convergence of the trajectory of the system to a global minimizer of the objective function. Finally, we provide multiple numerical examples illustrating the theoretical results.


Introduction
Let H be a real Hilbert space endowed with the scalar product ·, · and norm x = √ x, x for x ∈ H. In connection with the minimization problem min x∈H (x), we will study the asymptotic behavior of the second order in time evolution equation with initial conditions x(t 0 ) = x 0 ∈ H,ẋ(t 0 ) = u 0 ∈ H, where α ≥ 1, t 0 > 0, and β : [t 0 , +∞) − → [0, +∞) and b, λ : [t 0 , +∞) − → (0, +∞) are differentiable functions. We assume that : H − → R = R ∪ {±∞} is a proper, convex and lower semicontinuous function and denote by λ : H − → R its Moreau envelope of parameter λ > 0. In addition, we assume that argmin , the set of global minimizers of , is not empty and denote by * the optimal objective value of .
Our aim is to derive rates of convergence for the Moreau envelope of the objective function and the objective function itself to * , as well as for the gradient of the Moreau envelope of the objective function and the velocity of the trajectory to zero in terms of the Moreau parameter function λ and the time scaling function b. In addition, we will provide a setting that also guarantees the weak convergence of the trajectory of the dynamical system to a minimizer of . The theoretical results will be illustrated by multiple numerical experiments.

Historical remarks
Inertial dynamics were introduced by Polyak in [23] in the form of the so-called heavy ball with friction method with fixed viscous coefficient α > 0, to accelerate the gradient method for the minimization of a continuous differentiable function : H → R. This system was later studied by Alvarez-Attouch [3,4] and by Attouch-Goudou-Redont [11]. In these works, for a convex function , an asymptotic convergence rate of (x(t)) to * of order O( 1 t ) as t → +∞, as well as an improvement for a strongly convex function to an exponential rate of convergence, was proved. The weak convergence of the trajectories to a minimizer of was also established.
A major step to obtain faster asymptotic convergence in the convex regime was done by Su-Boyd-Candes [24], by considering in the second order dynamical system an asymptotic vanishing damping coefficienẗ for t ≥ t 0 and α ≥ 3. Second order dynamical systems with variable and vanishing damping coefficients for optimization were studied, for instance, in [17][18][19]. The system (2) corresponds to a continuous version of Nesterov's accelerated gradient method [21]. For the function values, rates of convergence of were obtained. For α > 3, in [9], it was shown that the trajectory of (2) converges weakly to an element of argmin , and in [13,20], the asymptotic convergence rate of the function values was improved to o( 1 t 2 ) as t → +∞. The following system, combining asymptotic vanishing damping with the Hessiandriven damping, was proposed by Attouch-Peypouquet-Redont in [15] x(t) for t ≥ t 0 , where : H − → R twice continuously differentiable and convex, α ≥ 3 and β ≥ 0. The Hessian-driven damping has a natural link with Newton's method and gives rise to dynamical inertial Newton systems [1]. The system (3) preserves the convergence properties of (2), while having for β > 0 other important features, namely, In addition, possible oscillations exhibited by the solutions of (2) are neutralized by (3).

Time scaling
Time scaling of the dynamical system (2) was used to accelerate the rate of convergence of the values of the function along the trajectory. The system (2) becomes through time scaling a dynamical system of the form where α ≥ 3, and b : [t 0 , +∞) − → (0, +∞) is a continuous scalar function, as it was introduced and studied by Attouch-Chbani-Riahi in [10]. For (4), it was shown that as t → +∞, and a convergence rate can be improved to o( 1 t 2 b(t) ) as t → +∞, if α > 3. In [7,8] (see also [5]), the dynamical system which combines viscous and Hessian-driven damping with time scaling, where α ≥ 1 and β, b : [t 0 , +∞) − → (0, +∞) are functions with appropriate differentiability properties, was investigated. A quite general setting formulated in terms of the dynamical system parameter functions was identified in which the properties of (5) concerning the convergence of the function values are preserved, while the gradient of strongly converges along the trajectory to zero, and the trajectory converges weakly to a minimizer of the objective function. In [7,8], a numerical algorithm obtained via time discretization of (5) was studied, exhibiting analogous convergence properties to the dynamical system.

Nonsmooth optimization
The Moreau envelope of a proper, convex and lower semicontinuous function : H → R has played a significant role in the literature when designing continuous-time approaches and numerical algorithms for the minimization of . This is defined as where λ > 0 is called the parameter of the Moreau envelope (see, for instance, [16]). For every λ > 0, the functions and λ share the same optimal objective value and the same set of minimizers. In addition, λ is convex and continuously differentiable with and ∇ λ is 1 λ -Lipschitz continuous. Here, xy 2 denotes the proximal operator of of parameter λ. For every x ∈ H and λ, μ > 0, we have On the other hand, for every x ∈ H, the function λ ∈ (0, +∞) → λ (x) is nonincreasing and differentiable (see, for instance, [6, Lemma A.1]), namely, Attouch-Cabot considered in [6] (see also [14] for a more general approach for monotone inclusions) in connection with the minimization of the proper, convex and lower semicontinuous function : H → R the following second order differential equation for t ≥ t 0 , where α ≥ 1, and λ : [t 0 , +∞) − → (0, +∞) is continuously differentiable and nondecreasing. Convergence rates for the values of the Moreau envelope, as well as for the velocity of the system, were obtained from where convergence rates for the along x(t) were deduced In addition, the weak convergence of the trajectories x(t) to a minimizer of as t → +∞ was established.
Attouch-László considered in [12] in the same context the dynamical system where α > 1 and β > 0, and the term d dt ∇ λ(t) (x(t)) is inspired by the Hessian-driven damping, and its existence is justified almost everywhere since the mapping t → ∇ λ(t) (x(t)) is locally absolutely continuous (see, for example, [12,Lemma 1]). It was shown that for λ(t) = λt 2 , where λ > 0, the system (9) inherits all major convergence properties of (8), and, in addition, the following convergence rates for the gradient of the Moreau envelope of parameter λ(t) and its time derivative along x(t) were established

Our contribution
In this paper, we derive a setting formulated in terms of α ≥ 1 and the parameter functions β, b and λ of the dynamical system (1) associated with the minimization of the proper, convex and lower semicontinuous function : H → R, which allow us to prove • convergence rates for the Moreau envelope, its gradient, and the velocity of the trajectory and as t → +∞, respectively; • convergence rates for the objective function and as t → +∞; • the weak convergence of the trajectory x(t) to a minimizer of as t → +∞. In addition, we provide a particular formulation of the derived general setting for the case when the parameter functions are chosen to be polynomials and illustrate the influence of the latter on the convergence behavior of the dynamical system by multiple numerical experiments.

Existence and uniqueness of strong global solution
This section is devoted to the topic of the existence and uniqueness of a strong global solution of the system of our interest. To this aim, we will rewrite (1) as a system of the first order in time equations in the product space H × H.
First, we assume that β : After multiplying the first line by b(t) -β(t) and the second one by β(t) and then summing them, we get rid of the gradient of the Moreau envelope in the second equation , and after simplification, we obtain for the dynamical system the following equivalent formulation In case β(t) = 0 for every t ≥ t 0 , (1) can be equivalently written as Based on the two reformulations of the dynamical system (1), we can formulate the following existence and uniqueness result, which is a consequence of the Cauchy-Lipschitz theorem for strong global solutions. The result can be proved in the lines of the proofs of Theorem 1 in [12] or of Theorem 1.1 in [15] with some small adjustments.

Energy function and rates of convergence for function values
In this section, we will define an energy function for the dynamical system (1) and investigate its dissipativity properties. These will play a crucial role in the derivation of rates of convergence for the Moreau envelope of and the objective function itself.
To shorten the calculations, we introduce the auxiliary function (see also [7,8]) For z ∈ argmin and consider the energy function E c : In the following theorem, we formulate sufficient conditions that guarantee the decay of the energy of the the dynamical system (1) and discuss some of its consequences.
are satisfied. Then, for a solution x : [t 0 , +∞) → H to (1), the following statements are true: Moreover, assuming that α > 1 and that where we used that Using (1) to replaceẍ(t), we may write the third summand in the formulation ofĖ c (t) for every t ≥ t 0 as Notice that the terms with ∇ λ(t) (x(t)),ẋ(t) cancel each other; thus, after simplification, we obtain for every t ≥ t 0 Thanks to (11), w(t) is positive for every t ≥ t 0 , thus which leads tȯ By (10) and the fact that λ is nondecreasing, we deduce that Let us choose c := α -1. According to (12), we obtain for the coefficient of λ(t) (x(t)) - * in (17) ( Therefore, (17) allows us to deduce for every t ≥ t 0 We have just established that E α-1 is nonincreasing, which for every t ≥ t 0 leads to From here, we obtain for every t ≥ t 0 which proves (ii). Moreover, by integration, we obtain +∞ t 0 and +∞ t 0 which are the claims (iii) and (iv).
From now on, we assume that α > 1 and choose c := α -1ε, where ε is given by (13). In this setting, (16) reads for every t ≥ t 0 , So, under the condition (13),Ė α-1-ε (t) ≤ 0 for every t ≥ t 0 . Integrating (21), we obtain which gives the claim (v). From the fact that the energy function is bounded from above, and it is nonnegative on [t 0 , +∞), it follows that the trajectory x is bounded, which is item (vi). Finally, from (13) and (20), we deduce the claim (vii) which finishes the proof.
The following auxiliary result will be needed later. (13)

Lemma 3 Suppose that α > 1 and
Proof Recall that according to (15), we have for every t ≥ t 0 We choose again c := α -1 and split the term (α -1)tw(t) x(t)z, ∇ λ(t) (x(t)) into the sum of two expressed in terms of ε given by (13). For every t ≥ t 0 , we havė By applying the convex subdifferential inequality, we obtain for every t ≥ t 0 Since β is nondecreasing, for every t ≥ t 0 , it holds thus, (13) leads to t 2ẇ (t) -(α -3ε)tw(t) ≤ 0. Consequently, we obtain from (25) by integration Now, we are in a position to improve the convergence rates we obtained previously in (18) and derive from here convergence rates for . (13) holds, that λ and β are nondecreasing on [t 0 , +∞), and that (11) holds. In addition, assume that

Theorem 4 Suppose that α > 1 and
where [·] + denotes the positive part of the expression inside the brackets, and that there exists C > 0 such that Then, for a solution x : [t 0 , +∞) → H to (1), it holds and Moreover, and and Proof First, we notice that for every t ≥ t 0 , it holds For every h > 0, by the monotonicity of the gradient of a convex function, we have so letting h tend to zero, we obtain Consequently, for every t ≥ t 0 , it holds where we used (6), (7), and the Cauchy-Schwarz inequality. Now, we multiply (1) by t 2ẋ (t) to deduce, using the inequality above and (14), for every t ≥ t 0 Using (27), we obtain for every t ≥ t 0 d dt Next, we show the integrability of the right-hand side of the expression above. The first term is integrable according to Theorem 2 (v), and the second one is integrable according to Theorem 2 (vii). Further, since and taking into the account the boundedness of the trajectory x established in Theorem 2 (vi) and that z ∈ argmin , we deduce as t → +∞. So, under the assumption (26), we obtain that there exists C > 0 such that for every t ≥ t 0 Applying Lemma 6 in the Appendix, we conclude that the following limit exists. We will show that L = 0. Supposing that L > 0, we deduce that there exists t * ≥ t 0 such that for every Integrating the last inequality on [t * , +∞), we arrive at the contradiction with the integrability of the left-hand side as proved in Theorem 2 (v) and (vii). Therefore, L = 0, and we obtain and Using the definition of the proximal mapping, we derive which yields and According to (6), we obtain from here as t → +∞.

Convergence of the trajectories
In this section, we will investigate the weak convergence of the trajectory x to a minimizer of .
Applying now Lemma 7 in the Appendix, we immediately get the existence of the limit of q(t) as t → +∞. By the definition of q and (35), we establish the first statement of Opial's Lemma (see Lemma 8 in the Appendix), namely, that, for any z ∈ argmin , lim t→+∞ x(t)z exists.
Considering a sequence {t k } k∈N such that {x(t k )} k∈N converges weakly to an element z ∈ H as k → +∞, we notice that {ξ (t k )} k∈N converges weakly to z as k → +∞. Now, the function being convex and lower semicontinuous in the weak topology allows us to write Hence, z ∈ argmin , and the second statement of Opial's Lemma is shown. This gives the weak convergence of the trajectory x(t) to a minimizer of as t → +∞.
Remark 1 In the hypotheses of Theorem 4, to obtain the convergence of the trajectories, besides (33), it is enough to assume that to guarantee (35). Indeed, in this case, (34) follows from the conclusion of Theorem 4 Remark 2 (Implicit discretization) Implicit discretization of the dynamical system (1) with fixed step size h > 0 leads to the numerical scheme that reads for every k ≥ 1 (see also [7,8]) where x k , λ k , β k , and b k denote x(kh), λ(kh), β(kh), and b(kh), respectively. Rearranging the terms, one obtains for every k ≥ 1 Relation (6), namely, and the property of the proximal mapping (see, for instance, [16]) lead to the following formulation of the implicit numerical algorithm where x 0 , x 1 ∈ H are given starting points.

Polynomial choices for the system parameter functions
According to the previous two sections, to guarantee both the fast convergence rates in Theorem 4 and the convergence of the trajectory to a minimizer of in Theorem 5, also by taking into account Remark 1, it is enough to make the following assumptions on the system parameter functions (I) α > 1, and there exists ε ∈ (0, α -1) such that (α -3)w(t)tẇ(t) ≥ εb(t) for every t ≥ t 0 ; (II) β and λ are nondecreasing on [t 0 , +∞); In this section, we will investigate the fulfillment of these conditions for where n, m, l ∈ R, b, λ > 0 and β ≥ 0. For this choice of b, condition (V) is fulfilled. First, we assume that β = 0. Then the conditions (III), (IV), and (VI) are fulfilled, while the conditions (II) and (VI) are nothing else than 0 ≤ l ≤ 1. Condition (I) asks for α > 1 and for the existence of ε ∈ (0, α -1) such that for every t ≥ t 0 (α -3nε)bt n ≥ 0 or, equivalently, α -3n ≥ ε. To this end, it is enough to have that α -3 > n.

The influence of b on the dynamical behaviour
First, let us choose as objective function : R → R + , (x) = |x|, fix m = 0, α = 9 and l = 1, and vary n.
In Fig. 1, we clearly see that the faster the exponent of the function b grows the faster the convergence of the function values of the Moreau envelope and its gradient are, starting with the slowest pace for n = 0 and accelerating until n = 4.99, confirming the theoretical convergence rates. In addition, the increase in the exponent of b also seems to improve the convergence behavior of the trajectory. Fast growing exponents for b will improve the convergence greatly; however, as seen in the previous section, they are limited by the upper bound value α -3.

The influence of λ on the dynamical behaviour
For the same objective function as in the previous subsection, we study the behavior of the dynamics when varying the exponent l to investigate the influence of the function λ.
To this end, we fix m = 0, α = 9 and n = 5 < α -3, and take for l three different values from 0 to 1. We also choose the starting point x 0 = 1 to provide a better illustration.
One can notice in Fig. 2 that the convergence behavior of the function values of the Moreau envelope and its gradient is better, the higher l is, whereas, interestingly enough, for the convergence of the trajectories, an opposite phenomenon takes place.

The influence of β on the dynamical behaviour
Let : R → R + , (x) = |x| + x 2 2 , α = 13, n = 9 < α -3 and l = 1. We vary the exponent m such that 2m < n + l to study the influence of the function β on the convergence behavior of the system. In Fig. 3, we see that, even though m does not explicitly appear in the theoretical convergence rates for the gradient of the Moreau envelope and the trajectory of the system, it influences the convergence behavior of both of them as well as of the function values of the Moreau envelope, in the sense that these are faster, the higher the values of m are. Finally, we consider two parameter choices, which lie outside the convergence setting derived in the previous section, and notice that these fundamentally affect the convergence of the trajectory. In Fig. 4(a), we choose m such that the condition 2m < n + l is violated, and in Fig. 4(b), we choose α and n such that the condition α -3 > n is also violated. One can see that in both settings, the trajectories diverge.
• for every z ∈ S, lim t→+∞ x(t)z exists; • every weak sequential cluster point of the map x belongs to S. Then, x(t) converges weakly to some element of S as t → +∞.